Evidence of episodic positive selection in Corynebacterium diphtheriae complex of species and its implementations in identification of drug and vaccine targets

Background Within the pathogenic bacterial species Corynebacterium genus, six species that can produce diphtheria toxin (C. belfantii, C. diphtheriae, C. pseudotuberculosis, C. rouxii, C. silvaticum and C. ulcerans) form a clade referred to as the C. diphtheria complex. These species have been found in humans and other animals, causing diphtheria or other diseases. Here we show the results of a genome scale analysis to identify positive selection in protein-coding genes that may have resulted in the adaptations of these species to their ecological niches and suggest drug and vaccine targets. Methods Forty genomes were sampled to represent species, subspecies or biovars of Corynebacterium. Ten phylogenetic groups were tested for positive selection using the PosiGene pipeline, including species and biovars from the C. diphtheria complex. The detected genes were tested for recombination and had their sequences alignments and homology manually examined. The final genes were investigated for their function and a probable role as vaccine or drug targets. Results Nineteen genes were detected in the species C. diphtheriae (two), C. pseudotuberculosis (10), C. rouxii (one), and C. ulcerans (six). Those were found to be involved in defense, translation, energy production, and transport and in the metabolism of carbohydrates, amino acids, nucleotides, and coenzymes. Fourteen were identified as essential genes, and six as virulence factors. Thirteen from the 19 genes were identified as potential drug targets and four as potential vaccine candidates. These genes could be important in the prevention and treatment of the diseases caused by these bacteria.

The genome sequence data were uploaded to the Type (Strain) Genome Server (TYGS), a free bioinformatics platform available under https://tygs.dsmz.de, for a whole genome-based taxonomic analysis [1]. The results were provided by the TYGS on 2021-02-09. The TYGS analysis was subdivided into the following steps: Determination of closely related type strains Determination of closest type strain genomes was done in two complementary ways: First, all user genomes were compared against all type strain genomes available in the TYGS database via the MASH algorithm, a fast approximation of intergenomic relatedness [2], and, the ten type strains with the smallest MASH distances chosen per user genome. Second, an additional set of ten closely related type strains was determined via the 16S rDNA gene sequences. These were extracted from the user genomes using RNAmmer [3] and each sequence was subsequently BLASTed [4] against the 16S rDNA gene sequence of each of the currently 14130 type strains available in the TYGS database. This was used as a proxy to find the best 50 matching type strains (according to the bitscore) for each user genome and to subsequently calculate precise distances using the Genome BLAST Distance Phylogeny approach (GBDP) under the algorithm 'coverage' and distance formula d 5 [5]. These distances were finally used to determine the 10 closest type strain genomes for each of the user genomes.

Pairwise comparison of genome sequences
For the phylogenomic inference, all pairwise comparisons among the set of genomes were conducted using GBDP and accurate intergenomic distances inferred under the algorithm 'trimming' and distance formula d 5 [5]. 100 distance replicates were calculated each. Digital DDH values and confidence intervals were calculated using the recommended settings of the GGDC 2.1 [5].

Phylogenetic inference
The resulting intergenomic distances were used to infer a balanced minimum evolution tree with branch support via FASTME 2.1.4 including SPR postprocessing [6]. Branch support was inferred from 100 pseudobootstrap replicates each. The trees were rooted at the midpoint [7] and visualized with PhyD3 [8].

Type-based species and subspecies clustering
The type-based species clustering using a 70% dDDH radius around each of the 120 type strains was done as previously described [1]. The resulting groups are shown in Table 1 and 4. Subspecies clustering was done using a 79% dDDH threshold as previously introduced [9].

Type-based species and subspecies clustering
The resulting species and subspecies clusters are listed in Table 4, whereas the taxonomic identification of the query strains is found in Table 1. Briefly, the clustering yielded 94 species clusters and the provided query strains were assigned to 20 of these. Moreover, user strains were located in 20 of 98 subspecies clusters.         The genome sequence data were uploaded to the Type (Strain) Genome Server (TYGS), a free bioinformatics platform available under https://tygs.dsmz.de, for a whole genome-based taxonomic analysis [1]. The results were provided by the TYGS on 2021-02-09. The TYGS analysis was subdivided into the following steps:

Figure caption SSU tree
Determination of closely related type strains Determination of closest type strain genomes was done in two complementary ways: First, all user genomes were compared against all type strain genomes available in the TYGS database via the MASH algorithm, a fast approximation of intergenomic relatedness [2], and, the ten type strains with the smallest MASH distances chosen per user genome. Second, an additional set of ten closely related type strains was determined via the 16S rDNA gene sequences. These were extracted from the user genomes using RNAmmer [3] and each sequence was subsequently BLASTed [4] against the 16S rDNA gene sequence of each of the currently 14130 type strains available in the TYGS database. This was used as a proxy to find the best 50 matching type strains (according to the bitscore) for each user genome and to subsequently calculate precise distances using the Genome BLAST Distance Phylogeny approach (GBDP) under the algorithm 'coverage' and distance formula d 5 [5]. These distances were finally used to determine the 10 closest type strain genomes for each of the user genomes.

Pairwise comparison of genome sequences
For the phylogenomic inference, all pairwise comparisons among the set of genomes were conducted using GBDP and accurate intergenomic distances inferred under the algorithm 'trimming' and distance formula d 5 [5]. 100 distance replicates were calculated each. Digital DDH values and confidence intervals were calculated using the recommended settings of the GGDC 2.1 [5].

Phylogenetic inference
The resulting intergenomic distances were used to infer a balanced minimum evolution tree with branch support via FASTME 2.1.4 including SPR postprocessing [6]. Branch support was inferred from 100 pseudobootstrap replicates each. The trees were rooted at the midpoint [7] and visualized with PhyD3 [8].

Type-based species and subspecies clustering
The type-based species clustering using a 70% dDDH radius around each of the 103 type strains was done as previously described [1]. The resulting groups are shown in Table 1 and 4. Subspecies clustering was done using a 79% dDDH threshold as previously introduced [9].

Type-based species and subspecies clustering
The resulting species and subspecies clusters are listed in Table 4, whereas the taxonomic identification of the query strains is found in Table 1. Briefly, the clustering yielded 85 species clusters and the provided query strains were assigned to 19 of these. Moreover, user strains were located in 19 of 87 subspecies clusters.    Publication-ready versions of both the genome-scale GBDP tree and the 16S rRNA gene sequence tree can be customized and exported either in SVG (vector graphic) or PNG format from within the phylogeny viewers in your TYGS result page. For publications the SVG format is recommended because it is lossless, always keeps its high resolution and can also be easily converted to other popular formats such as PDF or EPS. Please follow the link provided above!  The genome sequence data were uploaded to the Type (Strain) Genome Server (TYGS), a free bioinformatics platform available under https://tygs.dsmz.de, for a whole genome-based taxonomic analysis [1]. The results were provided by the TYGS on 2021-02-09. The TYGS analysis was subdivided into the following steps:

Figure caption SSU tree
Determination of closely related type strains Determination of closest type strain genomes was done in two complementary ways: First, all user genomes were compared against all type strain genomes available in the TYGS database via the MASH algorithm, a fast approximation of intergenomic relatedness [2], and, the ten type strains with the smallest MASH distances chosen per user genome. Second, an additional set of ten closely related type strains was determined via the 16S rDNA gene sequences. These were extracted from the user genomes using RNAmmer [3] and each sequence was subsequently BLASTed [4] against the 16S rDNA gene sequence of each of the currently 14130 type strains available in the TYGS database. This was used as a proxy to find the best 50 matching type strains (according to the bitscore) for each user genome and to subsequently calculate precise distances using the Genome BLAST Distance Phylogeny approach (GBDP) under the algorithm 'coverage' and distance formula d 5 [5]. These distances were finally used to determine the 10 closest type strain genomes for each of the user genomes.

Pairwise comparison of genome sequences
For the phylogenomic inference, all pairwise comparisons among the set of genomes were conducted using GBDP and accurate intergenomic distances inferred under the algorithm 'trimming' and distance formula d 5 [5]. 100 distance replicates were calculated each. Digital DDH values and confidence intervals were calculated using the recommended settings of the GGDC 2.1 [5].

Phylogenetic inference
The resulting intergenomic distances were used to infer a balanced minimum evolution tree with branch support via FASTME 2.1.4 including SPR postprocessing [6]. Branch support was inferred from 100 pseudobootstrap replicates each. The trees were rooted at the midpoint [7] and visualized with PhyD3 [8].

Type-based species and subspecies clustering
The type-based species clustering using a 70% dDDH radius around each of the 13 type strains was done as previously described [1]. The resulting groups are shown in Table 1 and 4. Subspecies clustering was done using a 79% dDDH threshold as previously introduced [9].

Type-based species and subspecies clustering
The resulting species and subspecies clusters are listed in Table 4, whereas the taxonomic identification of the query strains is found in Table 1. Briefly, the clustering yielded 5 species clusters and the provided query strains were assigned to 1 of these. Moreover, user strains were located in 1 of 5 subspecies clusters.   Tree inferred with FastME 2.1.6.1 [6] from GBDP distances calculated from genome sequences. The branch lengths are scaled in terms of GBDP distance formula d 5 . The numbers above branches are GBDP pseudo-bootstrap support values > 60 % from 100 replications, with an average branch support of 48.9 %. The tree was rooted at the midpoint [7].